An alternative scheme for perplexity estimation

نویسندگان

  • Frédéric Bimbot
  • Marc El-Bèze
  • Michèle Jardino
چکیده

Language models are usually evaluated on test texts using the perplexity derived directly from the model likelihood function. In order to use this measure in the framework of a comparative evaluation campaign, we have developped an alternative scheme for perplexity estimation. The method is derived from the Shannon game and based on a gambling approach on the next word to come in a truncated sentence. We also use entropy bounds proposed by Shannon and based on the rank of the correct answer, in order to estimate a perplexity interval for non-probabilistic language models. The relevance of the approach is assessed on an example.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

N-gram language modeling of Japanese using bunsetsu boundaries

A new scheme of N-gram language modeling was proposed for Japanese, where word N-grams were calculated separately for the two cases: crossing and not crossing bunsetsu boundaries. Here, bunsetsu is a basic grammatical (and pronunciation) unit of Japanese. A similar scheme using accent phrase boundaries instead of bunsetsu boundaries has already been proposed by the authors with a certain succes...

متن کامل

Context dependent language model adaptation

Language models (LMs) are often constructed by building multiple component LMs that are combined using interpolation weights. By tuning these interpolation weights, using either perplexity or discriminative approaches, it is possible to adapt LMs to a particular task. In this work, improved LM adaptation is achieved by introducing context dependent interpolation weights. An important part of th...

متن کامل

Combination of random indexing based language model and n-gram language model for speech recognition

This paper presents the results and conclusion of a study on the introduction of semantic information through the Random Indexing paradigm in statistical language models used in speech recognition. Random Indexing is an alternative to Latent Semantic Analysis (LSA) that addresses the scalability problem of LSA. After a brief presentation of Random Indexing (RI), this paper describes, different ...

متن کامل

Kernel Density Topic Models: Visual Topics Without Visual Words

The computer vision community has greatly benefited from transferring techniques originally developed in the document processing domain to the visual domain by means of discretizing the features space into visual words. This paper reinvestigates the necessity of this artificially discretization of the continuous space of visual features and consequently proposes an alternative formulation of th...

متن کامل

Use of contexts in language model interpolation and adaptation

Language models (LMs) are often constructed by building multiple individual component models that are combined using context independent interpolation weights. By tuning these weights, using either perplexity or discriminative approaches, it is possible to adapt LMs to a particular task. This paper investigates the use of context dependent weighting in both interpolation and test-time adaptatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997